43 research outputs found

    Quartets Enable Statistically Consistent Estimation of Cell Lineage Trees Under an Unbiased Error and Missingness Model (Abstract)

    Get PDF

    Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL

    Get PDF
    Here we introduce the Optimal Tree Completion Problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. More formally, given a pair of unrooted binary trees (T,t) where T has leaf set S and t has leaf set R, a subset of S, we wish to add all the leaves from S R to t so as to produce a new tree t\u27 on leaf set S that has the minimum distance to T. We show that when the distance is defined by the Robinson-Foulds (RF) distance, an optimal solution can be found in polynomial time. We also present OCTAL, an algorithm that solves this RF Optimal Tree Completion Problem exactly in quadratic time. We report on a simulation study where we complete estimated gene trees using a reference tree that is based on a species tree estimated from a multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach, but the accuracy of the completed gene trees computed by OCTAL depends on how topologically similar the estimated species tree is to the true gene tree. Hence, under conditions with relatively low gene tree heterogeneity, OCTAL can be used to provide highly accurate completions of estimated gene trees. We close with a discussion of future research

    TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees

    Get PDF
    Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1

    Leveraging Constraints Plus Dynamic Programming for the Large Dollo Parsimony Problem

    Get PDF
    The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for classic parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem in polynomial time for the Dollo criterion score. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. Thus, we implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility in the context of species phylogenetics using both simulated and real data sets. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony

    Detectability of Varied Hybridization Scenarios using Genome-Scale Hybrid Detection Methods

    Full text link
    Hybridization events complicate the accurate reconstruction of phylogenies, as they lead to patterns of genetic heritability that are unexpected under traditional, bifurcating models of species trees. This has led to the development of methods to infer these varied hybridization events, both methods that reconstruct networks directly, and summary methods that predict individual hybridization events. However, a lack of empirical comparisons between methods - especially pertaining to large networks with varied hybridization scenarios - hinders their practical use. Here, we provide a comprehensive review of popular summary methods: TICR, MSCquartets, HyDe, Patterson's D-Statistic (ABBA-BABA), D3, and Dp. TICR and MSCquartets are based on quartet concordance factors gathered from gene tree topologies and Patterson's D-Statistic, D3, and Dp use site pattern frequencies to identify hybridization events. We then use simulated data to address questions of method accuracy and ideal use scenarios by testing methods against complex networks which depict gene flow events that differ in depth (timing), quantity (single vs. multiple, overlapping hybridizations), and rate of gene flow. We find that deeper or multiple hybridization events may introduce noise and weaken the signal of hybridization, leading to higher false negative rates across methods. Despite some forms of hybridization eluding quartet-based detection methods, MSCquartets displays high precision in most scenarios. While HyDe results in high false negative rates when tested on hybridizations involving ghost lineages, HyDe is the only method to be able to separate hybrid vs parent signals. Lastly, we test the methods on ultraconserved elements from the bee subfamily Nomiinae, finding the possibility of hybridization events between clades which correspond to regions of poor support in the species tree estimated in the original study

    Advancing Divide-And-Conquer Phylogeny Estimation Using Robinson-Foulds Supertrees

    Get PDF
    One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS

    Analysis of shared heritability in common disorders of the brain

    Get PDF
    ience, this issue p. eaap8757 Structured Abstract INTRODUCTION Brain disorders may exhibit shared symptoms and substantial epidemiological comorbidity, inciting debate about their etiologic overlap. However, detailed study of phenotypes with different ages of onset, severity, and presentation poses a considerable challenge. Recently developed heritability methods allow us to accurately measure correlation of genome-wide common variant risk between two phenotypes from pools of different individuals and assess how connected they, or at least their genetic risks, are on the genomic level. We used genome-wide association data for 265,218 patients and 784,643 control participants, as well as 17 phenotypes from a total of 1,191,588 individuals, to quantify the degree of overlap for genetic risk factors of 25 common brain disorders. RATIONALE Over the past century, the classification of brain disorders has evolved to reflect the medical and scientific communities' assessments of the presumed root causes of clinical phenomena such as behavioral change, loss of motor function, or alterations of consciousness. Directly observable phenomena (such as the presence of emboli, protein tangles, or unusual electrical activity patterns) generally define and separate neurological disorders from psychiatric disorders. Understanding the genetic underpinnings and categorical distinctions for brain disorders and related phenotypes may inform the search for their biological mechanisms. RESULTS Common variant risk for psychiatric disorders was shown to correlate significantly, especially among attention deficit hyperactivity disorder (ADHD), bipolar disorder, major depressive disorder (MDD), and schizophrenia. By contrast, neurological disorders appear more distinct from one another and from the psychiatric disorders, except for migraine, which was significantly correlated to ADHD, MDD, and Tourette syndrome. We demonstrate that, in the general population, the personality trait neuroticism is significantly correlated with almost every psychiatric disorder and migraine. We also identify significant genetic sharing between disorders and early life cognitive measures (e.g., years of education and college attainment) in the general population, demonstrating positive correlation with several psychiatric disorders (e.g., anorexia nervosa and bipolar disorder) and negative correlation with several neurological phenotypes (e.g., Alzheimer's disease and ischemic stroke), even though the latter are considered to result from specific processes that occur later in life. Extensive simulations were also performed to inform how statistical power, diagnostic misclassification, and phenotypic heterogeneity influence genetic correlations. CONCLUSION The high degree of genetic correlation among many of the psychiatric disorders adds further evidence that their current clinical boundaries do not reflect distinct underlying pathogenic processes, at least on the genetic level. This suggests a deeply interconnected nature for psychiatric disorders, in contrast to neurological disorders, and underscores the need to refine psychiatric diagnostics. Genetically informed analyses may provide important "scaffolding" to support such restructuring of psychiatric nosology, which likely requires incorporating many levels of information. By contrast, we find limited evidence for widespread common genetic risk sharing among neurological disorders or across neurological and psychiatric disorders. We show that both psychiatric and neurological disorders have robust correlations with cognitive and personality measures. Further study is needed to evaluate whether overlapping genetic contributions to psychiatric pathology may influence treatment choices. Ultimately, such developments may pave the way toward reduced heterogeneity and improved diagnosis and treatment of psychiatric disorders

    Lick Observatory Supernova Search Follow-Up Program: Photometry Data Release of 93 Type Ia Supernovae

    Full text link
    We present BVRI and unfiltered light curves of 93 Type Ia supernovae (SNe Ia) from the Lick Observatory Supernova Search (LOSS) follow-up program conducted between 2005 and 2018. Our sample consists of 78 spectroscopically normal SNe Ia, with the remainder divided between distinct subclasses (three SN 1991bg-like, three SN 1991T-like, four SNe Iax, two peculiar, and three super-Chandrasekhar events), and has a median redshift of 0.0192. The SNe in our sample have a median coverage of 16 photometric epochs at a cadence of 5.4 days, and the median first observed epoch is ~4.6 days before maximum B-band light. We describe how the SNe in our sample are discovered, observed, and processed, and we compare the results from our newly developed automated photometry pipeline to those from the previous processing pipeline used by LOSS. After investigating potential biases, we derive a final systematic uncertainty of 0.03 mag in BVRI for our dataset. We perform an analysis of our light curves with particular focus on using template fitting to measure the parameters that are useful in standardising SNe Ia as distance indicators. All of the data are available to the community, and we encourage future studies to incorporate our light curves in their analyses.Comment: 29 pages, 13 figures, accepted for publication in MNRA

    GWAS meta-analysis of over 29,000 people with epilepsy identifies 26 risk loci and subtype-specific genetic architecture

    Get PDF
    Epilepsy is a highly heritable disorder affecting over 50 million people worldwide, of which about one-third are resistant to current treatments. Here we report a multi-ancestry genome-wide association study including 29,944 cases, stratified into three broad categories and seven subtypes of epilepsy, and 52,538 controls. We identify 26 genome-wide significant loci, 19 of which are specific to genetic generalized epilepsy (GGE). We implicate 29 likely causal genes underlying these 26 loci. SNP-based heritability analyses show that common variants explain between 39.6% and 90% of genetic risk for GGE and its subtypes. Subtype analysis revealed markedly different genetic architectures between focal and generalized epilepsies. Gene-set analyses of GGE signals implicate synaptic processes in both excitatory and inhibitory neurons in the brain. Prioritized candidate genes overlap with monogenic epilepsy genes and with targets of current antiseizure medications. Finally, we leverage our results to identify alternate drugs with predicted efficacy if repurposed for epilepsy treatment
    corecore